Tool-level and hardware-level code optimization and respective hardware modification

ABSTRACT

The present invention related to a method for compiling high-level software code into hardware, transforming each instruction into a respective hardware block and using an execution control signal representing the program pointer for triggering the execution within each respective hardware block.

CROSS REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.14/124,636, filed Jan. 13, 2014, which is a U.S. national phase ofInternational Patent Application No. PCT/EP2012/002419, filed Jun. 6,2012, which claims priority to European Patent Application No. 11 009912.4, filed Dec. 16, 2011, European Patent Application No. 11 007370.7, filed Sep. 9, 2011, European Patent Application No. 11 005 196.8,filed Jun. 27, 2011, and European Patent Application No. 11 004 667.9,filed Jun. 8, 2011, all of which are hereby incorporated by reference asif set forth in full in the application for all purposes.

PRIORITY

Priority is claimed to [1], [2], [3], [4], and [5].

REFERENCE TO COMPUTER PROGRAM LISTING APPENDIX AND INCORPORATION BYREFERENCE

A computer program listing appendix is included with this application ona compact disc (CD) (in duplicate). The entire contents of the programlisting appendix are incorporated by reference into the patent forcomplete disclosure. The file name, date, and size in bytes for thefiles submitted on the compact disc are:

Program Listing 1: LISTING1.txt, May 9, 2017, 11 KB

Program Listing 2: LISTING2.txt, May 9, 2017, 14 KB

Introduction and Field of Invention

Tools for compiling high-level software code (e.g. C, C++, Fortran, etc)to hardware are known in the prior art. For example compilers fromHandle-C(Celoxica), Impulse-C (Impulse Accelerated Technologies) areknown.

Those tools focus on transforming high-level code into as optimal aspossible hardware, e.g. in terms of area, power dissipation and/orperformance.

Those tools have in common that the high-level code has to be modifiedfor being transformable. Tools require hints (pragmas) to guide thecompiler and/or support only a subset of the high-level language, or areeven rather different but only use syntax similar to a known high-levellanguage.

With smaller silicon geometries, area limitations are not particularlycritical. With today's multi-core processor architectures evenperformance becomes less critical. However, power aspects becomeincreasingly important and a driving factor for compiling high-levelsoftware code to hardware. Simultaneously time-to-market contradictsmajor code modifications as required in the state-of-the-art.

This patent describes a method and hardware architecture which allowscompiling high-level software code to hardware without majormodifications.

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1 is a diagrammatic illustration of an example reconfigurablehardware architecture including arithmetic logic units (ALUs);

FIG. 2 is a diagrammatic illustration of an example first pass of anoptimizer algorithm;

FIG. 3 is a diagrammatic illustration of an example second pass of anoptimizer algorithm;

FIG. 4 is a diagrammatic illustration of an example structure providedin a processor;

FIG. 5 is a diagrammatic illustration of an example hardware modulegenerated by a hardware generator tool from source assembly code;

FIG. 6-1 is a diagrammatic illustration of an example implementation ofa processor core;

FIG. 6-2 is a diagrammatic illustration of an example loop in code to beprocessed by a processor;

FIG. 6-3 is a diagrammatic illustration of an example detection of loopinformation and setup and issuance thereof to a loop control unit;

FIG. 6-4 is a diagrammatic illustration of example setup and issuance ofinstructions to load units;

FIG. 6-5 is a diagrammatic illustration of example setup and issuance ofinstructions to store units;

FIG. 6-6 is a diagrammatic illustration of example issuance ofinstructions to Arithmetic Logic Units (ALUs); and

FIG. 6-7 is a diagrammatic illustration of an example enhancedinstruction set providing optimized instructions.

OVERVIEW AND INTRODUCTION

A standard high-level compiler might compile high-level code intoassembly code without any modifications. Note: Later in this patent arepossible optimizations described which may make the code produced by thecompiler more efficient.

Each resulting assembly instruction is then compiled into a hardwareblock, representing the instructions. The interface of the hardwareblock is defined by the source and target registers and possible statuschecking or generation of the assembly instruction.

In addition, each hardware block receives an execution signal (in_exec)for triggering operation and returns an execution signal (out_exec) ifthe operation has been performed and the result is produced.

A graph (code graphl.txt) representing the data transmission via theregisters is generated by an analyser function (ANALYZER). The hardwareblocks are chained respectively together by connecting the outputs andinputs of the blocks as defined by the register accesses of the assemblycode.

Each block could operate synchronous and comprise respective registersat the signal inputs and/or output.

However, in a preferred embodiment, the blocks operate asynchronous foroptimizing cycle time. Registers are inserted under e.g. at least one ofthe following exemplary conditions:

-   -   1. A block is complex and produces a long signal delay reduces        the target frequency. This could be e.g. complex calculations,        multipliers, etc.    -   2. A block depends on external timing and/or has to synchronize        with external data senders and/or receivers.

The signals in_exec and out_exec connect the blocks and trigger theiroperation. By such, the in_exec/out_exec signal chain represents theprogram pointer of a real processor executing the assembly code.

Load/Store operations are implemented and stack accesses are performedas described in the assembly code. Therefore even recursive code can beprocessed.

In one embodiment, the generated hardware function comprising theassembly code might use dedicated memories for holding data local to thehardware function. For example preferably at least a local stack memoryis implemented.

Blocks are grouped into modules. The contents of a module are defined bythe execution graph of the assembly code. Some exemplary rules are:

-   -   1. Jump target address defines the beginning of a module. Jump        target addresses can be detected by labels in the assembly code        and/or target addresses of jump instructions. A jump target        address is any jump target of any instruction (e.g. jmp; call;        mov pc,r . . . (in ARM assembly)).    -   2. Jump instructions define the end of a module (e.g. jmp; call;        mov pc,r_. (in ARM assembly)).    -   3. Saving the program counter (PC) into a link register (as e.g.        known from ARM processors).    -   If the PC is saved with an offset, simultaneously the address        PC+offset is defining the beginning of a new module.

Modules receive all register content according to the real processorsregister file as input and send all registers back as output. This mayinclude any status and/or processor control and/or processor supervisingregisters.

Modules are preferably synchronous and register the input and/or outputinformation. By such each module represents the register file of thereal processor. It receives data from the internally grouped blocks andprovides the data to other modules acting as register file.

If the signal delay within a module becomes too long to meet the targetfrequency, pipeline stages might be inserted in the module.

Modules are combined in a main routine (main). A call graph (which maycomprise additional information) is generated by analysing the programpointer modifications in the assembly language. The main routinerespectively has an input multiplexer in front of each module forreceiving the input data from all potentially previously executedmodules potentially jumping to/addressing the respective module. Theselection of each of the multiplexers is controlled by the exec signalsreceived from the previously executed module addressing the respectivemodule.

Reconfigurable Core

The shown algorithm performs efficiently for small algorithms andfunctions (such as e.g. a FIR filter or a DCT). However, more complexalgorithms, such as e.g. a complete H.264 video decoder will require anunacceptable amount of silicon area.

The solution to this problem is two-fold:

-   -   1. Using a reconfigurable hardware architecture, which is in        line with the processing model of the compiler algorithm. The        hardware platform is reused over time for various        configurations, each representing a part of the algorithm to be        processed.    -   2. Partitioning the compiled code into configuration, fitting        and efficiently performing on the reconfigurable hardware.

FIG. 1 shows a respective exemplary reconfigurable architecture. Thedatapath block (DPB, 0101) comprising the ALUs and interconnection ispreferably asynchronously implemented. It comprises an amount of ALU andinterconnection resources, typical examples are 4×4, 6×4, or 6×6 arraysof ALUs.

The ALUs are capable performing functions according to the instructionsset used by the compiler (or vice versa, the instruction set of thecompiler is in accordance with the functions provided by the ALUs). Inone embodiment, the required capabilities of the ALUs are analysed bythe compiler at compile time and the ALUs are respectively designed inhardware. In one particular embodiment, the ALUs within the datapath(0101) may be heterogeneous and offer different functionality. Therespective functions are defined by analysing the target algorithms,e.g. the compiler (or analysis tools such as profilers) could define onbasis of one or a set of target algorithms which instructions arerequired, at which position(s) within the datapath and/or how frequentlythey are required be required (e.g. in only one ALU within the datapath,2 ALUs or any other amount).

Preferably, the elements (e.g. ALUs) within the datapath operateasynchronously, so that all datapath operations are performed in asingle clock cycle and then the next configuration is executed or thesame configuration is repeated (e.g. for processing loops) once again.However, in some embodiments at least some pipelining might besupported, e.g. each or some of the ALUs comprise operand and/or resultregisters. In some embodiments some ALUs may comprise operand registers,while others comprise result registers and again others have noregisters at all. Yet in other embodiments, pipeline stages may be drawninto the datapath at several physical stages not related to ALUs, e.g.in between the first and second half of the datapath, or after eachquarter. The registers might be positioned in the interconnectstructure.

Operand data is received from a register file (0102) and the results arewritten back to the register file.

It is typical for configurable architectures, that any element of aconfigurable array can send data to any other element. Particularlythere is no strict limitation in which direction result data is send andfrom which direction operand data is received. Reference is made e.g. toXilinx's Virtex4, -5, -6, and -7 FPGA series, PACT's XPP architecture,IMEC's ADRES architecture and othet academic projects such asKressArray, Pleiades, PADDI, and DPGAs. This unlimited degree of freedomleads to various problems, e.g. the undefined timing and operationtermination characteristic, which even includes potential timing loopsin non-registered signal paths.

In the preferred implementation the datapath interconnectionarchitecture of the ZZYX processors is implemented in its strictestvariant, which is limited to a unidirectional top-to-bottom datatransmission. Reference is made to [4], e.g. FIG. 22. Also reference ismade to [1], e.g. FIGS. 4, 27 and 27 a. [1] and [4] are entirelyincorporated by reference into this patent for full disclosure, claimsof this patent may comprise features of [1] and [4]. In one embodimentdata transmission in an unidirectional horizontal direction betweenelements (e.g. ALUs) is also permitted (see e.g. [4] FIG. 22, 2299).This limited capability of the datapath provides a variety of benefits:For example, the maximum signal delay is exactly predictable, no timingloops are possible, and the datapath complexity (and silicon area) issignificantly reduced.

Particularly if the datapath operates asynchronously it is important toreduce the ALUs complexity to achieve an acceptable operating frequency.Therefore, complex time consuming instructions (e.g. multiplications)and/or multi-cycle instructions (e.g. division) are implemented in adedicated Complex Function Block (CFB) (0103) separated from thedatapath. The CFB performs such function(s) preferably within thecycle-time of the datapath (0101).

In one particular embodiment, the reconfigurable core might represent astandard processor or DSP, such as e.g. an ARMS, Intel ×86 or a TI C64.The register file might represent exactly the registers as available inthe respective processor. However, those register files are small,limiting the potential parallelism in the datapath. Therefore it ispreferred using a larger register file, e.g. providing 32 to 256registers depending on the amount of ALUs in the datapath. Analysis hasshown, that for typical datapath sizes (e.g. 4×4, 6×4, or 6×6 arrays ofALUs) register files of about 32 registers (±16) are reasonable.

In one implementation of the software tool-chain (e.g. C-compiler,linker, assembler, etc.) the compiler might use already a larger amountof registers and generate an incompatible binary with the standardprocessor and/or produce directly the configuration binary. In otherimplementations of the tool-chain a configuration-generator (CG) mightanalyse the register use and dependency and extend the amount ofregisters at that stage, e.g. by reassigning the registers or insertingvirtual registers.

Typically the reconfigurable core comprises a Configuration Pointer (CP)or Configuration Counter (CC) register which is the respectivesubstitution of the processors program pointer (PP) or program counter(PC). Within this patent we may use PC or PP, which are synonymous. TheCC/CP (0104) points to the next configuration in a configuration memory(0105) to be configured and executed indicated by 0106.

The configuration memory might be a non-volatile memory, such as a ROMor Flash-ROM. In other embodiments it might be a RAM, particularly aSRAM. The RAM might be preloaded from another instance. Alternativelythe RAM might operate similar to a Level-1 instruction cache (quasi aLevel-1 configuration cache) and receive configuration from highermemory instances in the memory hierarchy, e.g. a Level-2 and/or Level-3cache, or ultimately the system memory.

It shall be expressively noted, that the invention is applicable on ZZYXprocessor cores and may be used as a compiler or directly implemented inthe processor hardware e.g. as shown in FIG. 4.

Load/Store

In difference to reconfigurable architectures in the prior art,Load/Store Operations (LSO) are managed separately from otheroperations. The reasons are manifold, for example: LSO, particularlyload operations, require a different timing model than other operations(e.g. ADD, SUBtract, ShiFT) as the latency of the memory access must betaken into account. Further LSO, particularly load operations, limit theachievable parallelism due to the bottleneck of accessing the memory.

Those reasons are some examples why in the prior art the programmer hadto take specific measures to code and manage memory accesses and/or theysignificantly reduced the performance of the reconfigurablearchitecture. In the following an advanced architecture is describedovercoming those limitations in conjunction with a novel Optimizer forcompilers and/or low level tools to manage LSO more efficiently.

Particularly for timing reasons, LSO are not performed within thedatapath, but in a dedicated separated Load/Store Block (LSB) (0107).The LSB has direct access to the register file.

Within the LSB a plurality of Load/Store Units (LSU) might be located,each LSU (0108) operating in parallel to the others on dedicated memoryblocks (0109), supporting a plurality of data accesses in parallel. Someof the dedicated memories might be small scratch memories (e.g.Tightly-Coupled Memories (TCM)); some of them might be preloadable (sothat e.g. constant data can loaded); some of them might be loaded fromand off-loaded to the memory hierarchy (0110) (e.g. for supportingcontext switching), while others are only scratch da-to which could bedestroyed (e.g. in a context switch). Other dedicated memory might beLevel-1 data caches exchanging data with the higher memory hierarch(0110). Preferably each of the Level-1 caches operates in a dedicatedand exclusive memory address range, so that no coherence problems exist.Reference is made to [1], [2], [4] and [5] which are all entirelyincorporated by reference into this patent for full disclosure andclaims my comprise features of those references.

In a preferred embodiment the plurality of LSU is not only capable ofaccessing different memory ranges in parallel, but at least some of theLSU (preferably all of them) may support parallel accessing of datawithin a vicinity. This technology and benefits are described in detailin [5].

Conditional Execution

Processors tend to fine granular conditional execution, which frequentlydisturbs the linear increment of the program pointer (PC/PP).Theoretically a reconfigurable architecture could jump on thegranularity of configurations (see e.g. [7] or [8]) or select parts of aconfiguration for execution (see e.g. [91), if the tools are capable ofgenerating respective code. Particularly for selecting parts ofconfigurations for execution, it is required to replace the conditionaljump operation by a conditional execution enable signal.

Some processors, e.g. ARM provide one conditional execution filed perinstruction, which supports conditional execution based on one singlestate (e.g. one flag). More complex (e.g. nested) conditions are notsupported. However, even in rather simple algorithms it is notsufficient to support only one single conditional level.

While processors, such as ARM, have to trade-off between performance anddense instruction set, configurable technologies tend to have slightlylarger headroom and can afford some additional control bits for managingconditional execution in both the configuration data file and the actualhardware. Therefore, depending on the implementation 8 to 16 conditionallevels are supported in the inventive core, future implementations mayeven support more levels.

Synchronisation

Configurable processors in the prior art either support no implicitsynchronization (requiring explicit timing management by the softwareand/or programmer) as e.g. FPGAs, DPGAs and the like, or synchronizationstatemachines in each element (making timing entirely transparent to theprogrammer and/or software but requiring additional hardware) as e.g.PACT XPP.

In the inventive embodiment, synchronization occurs on a configurationlevel driven by the clock signal. Configurations have to terminate andcomplete data processing within the granularity of a clock cycle.Typically a configuration has to complete and terminate within onesingle clock cycles. However, depending on the complexity of the datapath in asynchronous implementations, data may take longer to cross thedatapath than a single clock cycle. It would certainly be possible tolimit the datapath such that any operation will be completed within aclock cycles, and/or extend the compiler tool chain such that it ensuresthat the generated code meets the timing requirements.

Clock

In a preferred embodiment the processor hardware and the complier toolchain supports (maybe additionally) to configure the synchronizationdepending on the execution time (signal delay time) within the datapath.In this case at least for some configuration the number of clock cyclesto complete da-to processing is configurable. In typical implementationssmall ranges, such as 2 to 4 clock cycles is sufficient, while fordatapaths supporting complex operations (e.g. floating point) the numberof clock cycles should be configurable in a far larger scale, e.g. up to128 clock cycles.

Description of a Code Example

For further explanation reference is made to the ARM Thumb assembly code(CODE1) in Appendix A of the present specification. This code is part ofthe ×264 Encoder [6]. The listed code does not require much explanationas it is obvious for a person skilled in the art that the lines startwith the line number, followed by an instruction or a label.Instructions comprise a 32-bit binary displayed in hexadecimal notation.Although Thumb instruction set is used, the binary is already displayedin standard ARM non-Thumb binary code. Constant values are indicated by.word. For more detailed information reference is made to the ARMinstruction set documentation.

The code in Appendix 1 (CODE1) is translated into a new format (CODE2)as shown in Program Listing 1 of the computer program listing appendix.

This translation may be performed at various levels, which may beclaimed separately.

In one embodiment, the translation is already performed at compile time.In this case not necessarily a translation is performed, but thecompiler back end may directly emit code according to CODE2. Therespective optimization is then part of the compiler optimization passesand the backend generates the respective code. The optimization passesmay operate on the compiler internal data structures, such as e.g. thedependency graphs, DAGs, etc.

In another embodiment, the translation is performed in an intermediatestep, analysing already compiled code and transforming it. This mayhappen in a separated tool run. Alternatively or additionally theoperating system may perform the translation e.g. while installing therespective software, or while starting the respective software (e.g. bythe loader). In yet another embodiment the translation is performed inhardware, e.g. within the micro-processor chip. As will be described,the translation may be performed at execution time. In one embodiment itis performed in front of and/or using an instruction cache and/or bufferstructure.

In one embodiment, the optimization is done at different stages, forexample:

At compile time the ideal instruction dependency might be analysed andthe instructions respectively sorted, particularly in respect of theload/store instructions (LSI).

Possibly at install and/or load time, e.g. for adaption ideally to thesystems memory and/or IO architecture.

At runtime within the processor, e.g. for expanding the binary from thelimited instruction set architecture (ISA) to the processors hardwarecapabilities (e.g. a virtual register file).

The first characters of the CODE2 listing are for debugging purposes (<,−, >) and do not require detailed explanation. The line number is next,followed by a conditional control field comprising 16 levels ofadditional conditional execution levels. The respective instruction isexecuted depending on the conditional control settings and status flagsprovided by earlier executed instructions, such as compare instructions(cmp). It is followed by the binary and the instruction mnemonic. Theregister references are enhanced with dependency information, anunderscore followed by the address of the register source, or in case ofa register target followed by the current address. The mnemonic isenhanced with conditional and flag information. For each conditionallevel the flag source is provided, indicated by “f_” followed by thesource's address. Instructions generating flags are enhanced with “c_”and the current address, indicating the conditional information (flags)are provided.

Instruction using a register (e.g. as base address), modifying it andwriting back a different value, are enhanced with the full registeraccess sequence: For example push {r3, r4, r5, r6, r7, lr} uses thestack pointer (r13) as base address for the data transfer and returnsthe modified stack point to the register. Consequently the instructionis replaced by push /t13_007c, s13_0000/!, {s3_0000, s4_0000, s5_0000,s6_0000, s7_0000, s14_0000} for indicating all register transactionsrequiring for the optimization.

It shall be mentioned here, that the reference _0000 indicates that theregisters of the function (callee) are set prior to entering thefunction by the calling routine (caller).

The code (CODE2) is partitioned into boxes (BOX n). All instructionswithin a box can execute simultaneously. In this embodiment anasynchronous datapath is implemented, supporting a plurality ofinstructions, even depending ones, to be issued in one clock cycles andexecuted in one clock cycle. Load/Store instructions cannot be executedasynchronously due to the rather long delays and/or the pipelined natureof the memory subsystem, e.g. even a level-1 cache access may require aplurality of clock cycles. Therefore depending Load/Store instructionsare placed in different boxes.

In one embodiment a box comprises a sequence of instructions which mightexecute concurrently and asynchronously. Instructions might depend onother instructions, e.g. a result of an instruction might feed operandsof one or more subsequent instructions. Instructions within a box mightexecute conditionally, e.g. as known from the ARM instruction set (seecondition field of ARM instructions, e.g. [13]). A jump (or call) orconditional jump (or call) defines the end of a box and is usually thelast instruction to be placed into a box. In another embodiment theinstructions within the box must be independent, i.e. a result of aninstruction might not feed operands of one or more subsequentinstructions.

In one embodiment a jump target (instruction to be jumped to by a jumpor conditional jump instruction) must be located at the very beginningof a box, i.e. the first instruction. Only one jump target is supportedper box.

In another embodiment multiple jump targets might be allowed within abox which can even be placed at random position. In this case thehardware must decode the instruction pointer PP delivered by theprevious box initiating the jump into the box and omit all instructionsprior to the instruction pointer PP.

In a preferred embodiment, all instructions of a box have the sametiming, e.g. perform in a single clock cycle, two clock cycles, . . . nclock cycles. Also instructions with the same timing model might begrouped, e.g. load instructions which have to wait for the incomingdata, or store instructions which depend on the availability of writeaccess to memory. Those boxes trigger all individual instructions oncefor a single execution, individual instructions might perform andterminate independently of others. The box is only left, i.e. executioncontinued with the next subsequent box, if each individual instructionhas been performed and terminated.

A box is left with a pointer to the next box to be executed, preferablythis pointer is the instruction pointer PP. For further details see FIG.5.

Load/Store instructions (LSI) are grouped. In an ideal embodiment of theLoad/Store Units (LSU) all or at least some of the Load/Storeinstructions (LSI) can be executed simultaneously. Theoretically all LSIwithin a box can be executed in parallel, however the capabilities ofthe LSU may limit the achievable concurrency.

It is therefore preferred, that a plurality of memories is implemented,each memory operating on a different address range. Reference is made to[1] and [2], describing respective memory hierarchies. It shall be notedagain, that these patents are entirely incorporated by reference forfull disclosure and respective claims may use features described inthem.

In an even more preferred embodiment a plurality of addresses areaccessible simultaneously within at least some of the single memories.Reference is made to [5], describing a respective memory. It shall benoted again, that this patent is entirely incorporated by reference forfull disclosure and respective claims may use features described in it.

Load/Store instructions (LSI) and other datapath instructions may beplaced into the same box. In this case no dependency between theLoad/Store instructions (LSI) and any other datapath instruction isallowed within the same box.

The Optimizer

Various implementation of a respective optimizer algorithm are possible.For example may the implementation depend on the environment in whichthe algorithm is used.

An exemplary optimizer algorithm is described which operates onnon-optimized binaries already generated by a compiler. The algorithmmight be used in a separated tool optimizing code or by an operatingsystem optimizing code while loading or installing it.

A slightly modified embodiment might be implemented in hardware, e.g. ina micro-processor.

Other optimizer algorithms might operate on the Data-Flow-Graph (DFG)and Control-Flow-Graph (CFG) within a compiler and use Dependency Graphsand analysis for scheduling the instructions.

The exemplary algorithm operates in two passes:

The first pass is outlined in FIG. 2. The optimizer pass (OP1) moveslinearly through the binary, from the first to the last instruction.Each instruction is analyzed (0201).

-   -   OP1-1. If the instruction is a jump instruction (0202), the type        of jump instruction is checked.        -   a. If it is a conditional and optimizable instruction            (0203), it is checked whether the jump points backwards            (0204) in the code to an earlier instruction.            -   i. In this case a loop is detected and respectively                marked (0205).            -   ii. If not, a conditional code section has been                detected. The jump instruction is removed from the code,                but a respective entry with all information of the jump                instruction, including the condition for its execution,                is put (0206) onto a jump-stack (jmpstk) comprising all                information of conditional and optimizable jump                instructions in the order of their occurrence.                Additionally flag dependency data is added to the jmpstk                indicating from which flag the conditional jump depends                on and which instruction generates the respective flag.        -   b. If it is a non-conditional or non-optimizable instruction            (0207) (in this simple exemplary embodiment of the            optimizer) all jumps still present on the jmpstk (see below)            are written back into the code (0208) in stack order, i.e.            from top to bottom.    -   OP1-2. If the instruction is not a jump instruction (0212):        -   a. It is checked if the address is a target address of a            jump instruction by evaluating the jmpstk.            -   i. If so, the respective entry/entries is/are removed                from the jmpstk.        -   b. The jmpstk is evaluated and conditional control            information added (0215). As conditional jump instructions            might be removed from the code in 0206, it is necessary to            add the respective conditional execution control code to the            instruction.        -   c. Flag dependency information is collected from the jmpstk            and added to the instruction (0216).        -   d. Registers are renamed (0217).

The second pass is shown in FIG. 3. The optimizer pass (OP2) moveslinearly through the enhanced binary data produced by the first pass.Each instruction is analyzed (0301).

-   -   OP2-1. If the current instruction is not a jump instruction        (0302), the latest source of the instruction is detected based        on the register reference information generated by the first        pass. The information of the latest (latest in terms of last        recently produced) source is retrieved.        -   a. If the current instruction is a store instruction (0304)            it is placed into the next subsequent box after the box in            which the latest (latest in terms of last recently produced)            source instruction is placed (0305); else            -   i. If the current instruction's latest (latest in terms                of last recently produced) source is a nonload                instruction (0306), the current instruction can be                placed into the same box as the source instruction                (0307). (However, if the current instruction is a                multi-cycle instruction (such as e.g. multiplication,                division, load, etc.) it is also preferably placed into                the next subsequent box after the box in which the                latest (latest in terms of last recently produced)                source instruction is placed (see dotted line 0307 a).)            -   ii. If the current instruction's latest (latest in terms                of last recently produced) source is a load instruction                (0308), the current instruction is placed into the next                subsequent box after the box in which the latest (latest                in terms of last recently produced) source instruction                is placed (0309).    -   OP2-2. If the current instruction is a jump instruction (0312):        -   a. If the current instruction is a conditional jump (0313)            it is placed into the same box as the instruction the            condition depends on (0314), e.g. a compare or flag            generating arithmetic instruction. Conditional jumps can be            handled that way and moved within the box structure as the            code is controlled by the conditional control information            added by 0215.        -   b. If the current instruction is a non-conditional jump            (0315) it is placed in the last available box (0316). With            that all boxes (prior to and the last) are closed (0317).            Subsequent instructions cannot be moved into one of those            boxes as this would destroy the correct execution of the            code. Therefore subsequent instructions are moved into a new            set of boxes beginning after the last box.

An exemplary respective algorithm operating on assembly code (e.g.CODE1) is shown in Program Listing 2 of the computer program listingappendix. It produces assembly code again (e.g. CODE2) and is written inPerl for easy understanding.

Exemplary Transformation

Either high level code (e.g. C, C++, Fortran, . . . ) is compiled toassembly code, or plain assembly code is directly taken as an input. Anycompiler for a given existing processor architecture can be used.

The assembly code is reordered and grouped into boxes by a hardwaregenerator software tool as described. Each box comprises the respectiveinstructions. Each box has one or more entry points. If at runtime a boxis enabled, the correct entry point is chosen depending on programpointer PP provided when enabling the box. PP is decoded and selects theaccording instruction as entry point.

FIG. 5 shows a hardware module generated by a hardware generator toolfrom source assembly code. The hardware module is typically emitted bythe hardware generator tool in a hardware description language (e.g.Verilog, VHDL), e.g. either as synthesizable code or a gate netlist. Thehardware module is a hardware representation of the software sourcecode. A hardware library provides hardware macros for all usedinstructions, e.g. for an add instruction a hardware adder is provided,for a load instruction a hardware load unit, etc.

For each instruction the respective hardware macro is instantiated andplaced into the respective box's hardware module by the hardwaregenerator software tool.

FIG. 5 shows 3 boxes (0501, 0502, 0503). Each box comprises aninstantiated macro (0500) for each of the instructions of the box'sassembly code. The instructions respectively macros are placed andinterconnected in order of their execution sequence. For each registerof the register file (for example data registers, address registers,control registers (such as e.g. status), and preferably the programpointer PP), a respective bus is implemented within the box. Operanddata is taken from the buses (indicated by ‘o’) and result data is putonto the buses (indicated by ‘x’) according to the source and targetregisters of each respective instruction. Result data might either bedirectly connected to the bus, i.e.

drive the bus for all subsequent instructions and the bus connection toprevious instructions is respectively disconnected (see e.g. 0533,0534), or inserted into the bus via a multiplexer selecting result datafrom the respective macro instead the previous bus information.

For example instruction 0531 receives operand data from register busesr1 and r15 and puts its result back onto register bus r1; instruction0532 receives operand data from register buses r0 and r1 and puts itsresult back onto register bus r15.

The register buses in the box modules may contain pipeline stages (0510)for reducing the overall delay. The insertion of pipeline registersdepends on the application's target clock frequency and latencyrequirements. Pipeline registers are preferably automatically insertedby the hardware generator tool (e.g. 0510).

The register file buses are fed by the register file (0541) located inthe main module (0504). In alternative embodiments, the register filemight not be located in the main module, but formed by pipelineregisters in the box modules, e.g. 0511. Respectively pipeline registersmight be inserted directly at the output of each box module (asexemplary shown by 0511).

The last instruction of a box typically defines the program pointer PP(0512) to continue with. This is particularly true if the lastinstruction is a conditional or unconditional jump (or call)instruction. This program pointer is transmitted to the main module forselecting the next subsequent box for execution. If a pipeline registerexists at the exit of the box (e.g. 0511), the program pointer isregistered too.

All result data from the boxes is fed to the main module, in a oneembodiment one bus for each for each register and box. For writing backthe result data of each of the boxes into the register file, amultiplexer bank (0542) in the main module (0504) selects the source busfor each register of the register file (0541) according to the currentlyactive box, e.g. multiplexer 0543 for register r0, 0544 for register r1,and 0545 for r15.

Multiplexer 0551 selects the target program point (i.e. the address ofthe next instruction to be executed) (0512) according to the currentlyactive box. The target program pointer is decoded by a decoder (0552)and stored in a register (0553). The stored information enables the boxcomprising the respective instruction according to the target programpointer of the current execution in the next execution cycle (e.g. clockcycle) and selects the result data and target program pointer of thenext execution via the multiplexer bank 0542 and register 0551respectively.

In addition to the connectivity of the boxes shown in FIG. 5, otherbuses might be present, e.g. for interfacing to memory (such as caches,tightly coupled memory, etc) and/or peripherals.

Depopulated Buses

Register buses not used within a box, might be eliminated within a boxfor reducing the hardware overhead. The respective registers of theregister file (0541) are disabled when writing back result data from therespective box module.

In-Line Conditional Execution

Conditional execution is supported within boxes, if no branching isrequired. The ARM instruction set, for example, supports respectiveoperations by the condition field of at least some instructionsdefining, see [11]. The respective status information is transported viathe box internal buses, similar as the register data. It shall beexplicitly mentioned, that the register file 0541 also comprises astatus register, with a multiplexer selecting the source box—similar asthe data registers. A conditionally executed macro (0500) evaluates thestatus information from the status bus according to its conditionsetting (e.g. the condition field in the ARM instruction set, see [11]).If the condition is met, the results data of the macro is written ontothe respective bus according to the target register. The result data isconditionally written to the bus via a multiplexer in the bus (e.g. inplace of 0534 or 0533 respectively). Alternatively the respective busdata is by passed via a multiplexer inside the macro, in which case thetarget bus becomes an additional input to the macro just for providingthe data to be bypassed in case the condition is not met.

Effect of Managing the Program Pointer PP

The active management of the PP by transferring between the boxes andthe main module and in some embodiments even within boxes allows forflexible processing of jumps. Depending on the implementation even forjump tables and runtime calculated address values.

In some embodiments boxes (e.g. 0501, 0502, 0503) and/or macros (e.g.0500) might even compare their code address range and/or code address tothe currently transmitted program pointer PP and become active when thecurrent program pointer meets the boxes and/or macros set point.

Applicability on Processor Hardware

The inventive optimizer is applicable on processors by integrating itinto the processor hardware. The optimizer reduces the complexity andpower dissipation of processors, particularly Out-of-Order processors,as for example large and power consuming Reorder Buffers are replaced bythe inventive technology.

Reference is made to FIG. 4. An instruction loader generates addresses(0402) to the instruction memory hierarchy (e.g. the Level-2 cacheand/or Level-3 cache and/or system main memory) for fetchinginstructions (0403) to be executed. The loader is driven by the programpointer and loads instructions ahead of their execution.

In a preferred embodiment the optimizer of FIG. 4 operates on completeroutines and/or functions for best optimization results. The loader mustensure that a complete routine or function is loaded for optimization.For example a function shall be defined as a code block starting withthe entry (first address of a routine) and ending with a return (ret)instruction. Depending on the processor implementation and its specificinstruction set, other conventions and implementations (particularly ofthe instruction set) are possible.

In a very basic implementation the loader starts loading instructionswith the first address of code not loaded yet and/or not being presentin any of the code buffers and/or caches downstream the loader, i.e.between the loader and the execution units of the processor, until thefirst return instruction leaving the routine. However, it cannot beguaranteed that the first return instruction is the only one leaving theroutine. Due to conditional execution other return instructions mayexist, so that the first found return instruction might not be asufficient criterion.

In an ideal implementation it must be ensured that the complete softwareroutine and/or function is loaded. Various implementations are possible,for example:

a) The loader uses a jump directory (0404) in which each target addressof detected jump instructions are stored and performs the followingsteps:

-   -   LD1. Whenever a target address is reached and read by the loader        it is removed from the jump directory.    -   LD2. If a return instruction is detected, but the jump directory        is not empty obviously more code must exist belonging to the        function. In that case the loader continues loading        instructions.    -   LD3. If a return instruction is detected and the jump directory        is empty no more code exists belonging to the function. The        loader stops loading code.    -   LD4. If a non-conditional jump backwards in the code is detected        and the jump directory is empty no more code exists belonging to        the function. The loader stops loading code.

b) The optimizer (0405) performs the steps of a) and instructs theloader respectively (0406). It might use a respectively amended jmpstk(0407) to perform the functions of the jump directory.

The loader forwards the loaded instructions to a first optimizer pass(0405) (e.g. OP1 as described above). The optimizer pass uses a stackmemory (0407) to implement the jmpstk.

In a register file (0408) for each register the latest sourceinstruction is stored.

The results of first optimizer pass are written into a buffer memory(0409). Various embodiments are possible, for example: In oneembodiment, the buffer memory is a processor internal buffer. The loadergets the instructions from a Level-1 Cache.

In another embodiment, the loader and first optimizer pass are locatedin front of the Level-1 Instruction Cache. The loader gets theinstructions from a Level-2 Cache. The Level-1 Instruction Cache is usedas buffer memory 0409.

The second optimizer pass (0410) gets the instructions from the buffermemory (0409) and writes the optimized instructions into a second buffermemory (0412). The second optimizer pass might use an instruction boxreference memory (0411) for looking up the box in which a specificinstruction has been placed. The instruction's address is translatedinto a memory reference, under which the identification can be found ofthe box into which the respective instruction has been placed.

The Execution Units (EX) receive the instructions (or microcode at thisstage) from the second buffer.

The second buffer can be located at various positions, depending on theimplementation. For example:

In one embodiment, the second buffer is the Level-1 Instruction Cache,in this case, the first buffer might be a Level-2 Instruction Cache.

In another embodiment, the second buffer is a Trace Cache replacingand/or extending the functionality of the Trace Cache. During the secondoptimizer pass instructions are translated into microcode and possiblyfused or split. For details of the Trace Cache, reference is made to thePentium-4 Processor, particularly to [10], [11] and [12]. [10], [11] and[12] are entirely incorporated by reference into this patent for fulldisclosure, claims of this patent may comprise features of [10], [11]and [12]. For further details on micro-10 code fusion and splitting,particularly in conjunction with the inventive processor technology,reference is made to [4] and [5].

In yet another embodiment, the second buffer might be in the position ofa ReOrder Buffer (ROB), replacing and/or extending its functionality.Respectively the first buffer might be a Trace Cache (if implemented) ora Level-1 Instruction Cache.

In yet another embodiment, the second buffer might be replacing and/orextending the Reservation Stations of a processor.

It shall be noted that FIG. 4 is based on the assumption that theexemplary 2 pass optimizer of FIG. 2 and FIG. 3 is implemented.Depending on the algorithm, one-pass or multi-pass (with 2 or morepasses) optimizers might be used. Respectively and obvious for oneskilled in the art the block diagram of FIG. 4 is modified. For examplea one-pass optimizer might not need the buffer 0409, while multi-passoptimizers might have one buffer implemented between each of the passes.

Ideally the structure according to FIG. 4 reads ahead of the ProgramPointer. The capability of following the program flow by analyzing thepotential jump targets allows for intelligent prefetching of code. Forexample a complete function can be prefetched without furtherconsideration (e.g. by the programmer and/or compiler and/or othertools). The inventive steps not only allow that the complete function orroutine is prefetched, it also stops once the function or routine hasbeen loaded without prefetching subsequent memory locations (e.g. otherparts of the code) and thus wasting energy and memory bandwidth.

In one advanced embodiment, function calls within the prefetchedfunction or routine might be detected. For example call instructionscould be placed in a dedicated call stack memory e.g. by the loader oroptimizer or respectively marked in the jmp directory (0404) and/orjmpstk (0407). Preferably after the complete function has been(pre-)fetched, the called functions are prefetched, so that they areavailable prior to calling them. It might be beneficial to priorized theprefetching, e.g. by prefetching functions early in the code and/orthose with are non-conditionally called first. Then conditionally calledfunctions are fetched, while the conditions might be evaluated for thelikelihood of their execution. Several evaluation methods are possible,for example:

-   -   a) based on statistics (e.g. stored together with the binary);    -   b) jump instructions including a likelihood indicator of their        execution (e.g. 2 bits: 00—unlikely, 01—rather unlikely,        10—rather likely, 11—likely);    -   c) based on how the condition is expressed in the binary: The        condition false path might be defined as likely, while the        condition true path might be defined as unlikely (or vice        versa);    -   d) a dummy instruction located before the (function call)        instruction indicating the likelihood of the execution of the        call, the instruction might be only evaluated by the        loader/optimizer and removed afterwards.

Obviously the evaluations might also be very useful for otherconditional jumps (other than calls) for providing optimizationinformation to the processor, e.g. for loop optimization and/orspeculative execution.

Known from [5], which is expressively incorporated by reference, areoptimizations for managing constant data. In addition to theoptimizations discussed in [5], processors having buffers for storinginstructions which are frequently accessed may replace instructionsloading constant data once the constant data has been loaded from memorywith the constant data itself. If the processor comprises a Trace Cacheas previously described, in one preferred embodiment, the constant datais written into the trace cache replacing the original instruction. Inother embodiments, instructions in the Reorder Buffer and/or ReservationStation might be replaced with the constant data. By replacing the loadinstruction with the actual data, the instruction will not be executedanymore and therefore the respective time and energy is saved.

Is shall also be further noted, particularly referring to [5], that someprocessor instruction sets support loading data from a Program Counter(PC) relative address, e.g. ARM's LDR <Rd>, [PC, #<immed8>*4]. Astypically data within the code segment (relative to the program pointer)is generated at compile and/or link time, such data is constant.Therefore pc relative load instructions might be recognized and/or usedas dedicated load-constant instructions (e.g. ldc).

Prior Art

Trace Caches are known in the prior art, which store a transformedrepresentation of original binary code. Reference is made to thePentium-4 Processor, particularly to [10], [11] and [12]. [10], [11] and[12] are entirely incorporated by reference into this patent for fulldisclosure, claims of this patent may comprise features of [10], [11]and [12]. However Trace Caches are significantly different from thisinvention. Some major differences are:

-   -   1. Trace Caches store decoded instructions in microcode format.        The code stored in Trace Caches is not optimized, but solely        decoded (transformed) from a dense binary opcode representation        into a large vector controlling the processor hardware.        -   The inventive caches store modified code, in which e.g. the            sequence of instructions is rearranged, and/or source and/or            target registers are changed (even replaced by Execution            Units, e.g. ALUs in the ALU-Block of a ZZYX processor),            and/or instructions are combined and/or grouped to form code            blocks (e.g. Catenas, reference is made to [1] and [3], both            incorporated by reference into this patent for full            disclosure, claims of this patent may comprise features from            [1] and [3]) which are executed and/or scheduled            simultaneously. The optimizer routine may also perform            completely different or additional optimizations of the            binary which results are stored in the inventilw cache.    -   2. Trace Caches store only code which has been already executed.        Only code which have been previously addressed by the        processor's program pointer, respectively fetched and decoded        are present in a Trace Cache. This invention, however, operates        (timely and locally) in front of the program pointer. While it        is under control of the program pointer (e.g. jumps affect the        next code range to be fetched, optimized and stored) the loading        and optimization of a code is done ahead of addressing the code        by the program pointer. The inventive caches comprise codes        which have not been addressed by the processor's program pointer        yet.

While Trace Caches store only code which has already been executed, theinventive caches store complete code blocks, e.g. subroutines,functions, loops, inner loops etc.

Exemplary Embodiment

FIG. 6-1 shows an exemplary embodiment of a ZZYX core. The core is ARMcompatible; has 6 Load Units, capable of operating in parallel; has 4Store Units, capable of operating in parallel; has 8 ALUs arranged intwo columns; and has an efficient network, providing top down dataflowand access to load/store units.

FIG. 6-2 shows an exemplary loop: The code is emitted by the compiler ina structure which is in compliance with the instruction decoder of theprocessor. The instruction decoder (e.g. the optimizer passes 0405and/or 0410) recognizes code patterns and sequences; and (e.g. a rotor,see [4] FIG. 14 and/or [1] FIG. 17a and FIG. 17b ) distributes the codeaccordingly to the function units (e.g. ALUs, control, Load/Store, etc)of the processor.

Referring to FIG. 6-2: 1. The code is plain ARM code, executable on anyARM core. Note: the registers bp[0]-[3] relate to any available registerr[n]. 2. The compiler generates a predefined pattern: a) loopheader/footer; b) load/store striding; c) conditional store; and d) codegenerator and instruction decoder/placer using same algorithm forplacing instructions. 3. The instruction decoder detects the pattern andissues code accordingly. 4. Register dependencies are resolved andmapped to the network.

The code of the exemplary loop shown in FIGS. 6-2, 6-3, 6-4, 6-5, and6-6 is also provided below for better readability:

mov r1, r1 ; Switch on optimization mov r13, #0 loop: cmp r13, #7 beqexit ldr r2, [bp0], #1 ; old_sm0 ldr r3, [bp0], #1 ; old_sm1 ldr r4,[bp1], #1 ; bm00 add r0, r2, r4 ldr r4, [bp1], #1 ; bm10 add r1, r3, r4ldr r4, [bp1], #1 ; bm01 add r2, r2, r4 ldr r4, [bp1], #1 ; bm11 add r3,r3, r4 cmp r0, r1 movcc r0, r1 str r0, [bp2], #1 ; new_sm0 xor r0, r0,r0 ; dec0 . . . strbcc r0, [bp3], #1 movcs r0, #1 strbcs r0, [bp3], #1; . . . dec0 cmp r2, r3 movcc r2, r3 str r2, [bp2], #1 ; new_sm1 xor r0,r0, r0 ; dec1 . . . strbcc r0, [bp3], #1 movcs r0, [bp3], #1 ; . . .dec1 add r13, r13, #1 b loop exit: mov r0, r0 ; Switch off optimization

The listed code has the identical structure as in the Figures for easyreferencing.

FIG. 6-3 shows the detection of the loop information (header and footer)and the respective setup of/microcode issue to the loop control unit. Atthe beginning of the loop the code pattern for the loop entry (e.g.header) is detected (1) and the respective instruction(s) aretransferred to a loop control unit, managing loop execution. At the endof the loop the pattern of the according loop exit code (e.g. footer) isdetected (1) and the respective instruction(s) are transferred to a loopcontrol unit. For details on loop control reference is made to [1] inparticular to “loop control” and “TCC”.

The detection of the code pattern might be implemented in 0405 and/or0410. In particular microcode fusion techniques might apply for fusingthe plurality of instructions of the respective code patterns into(preferably) one microcode.

FIG. 6-3, Arrow 1: enter Loop-Acceleration Mode and initializeLoopControl. Arrow 2: Final setup of LoopControl, start execution,terminate Loop-Acceleration Mode after exit criterion is met.

FIG. 6-4 shows the setup of/microcode issue to the Load Units inaccordance with detected instructions. Each instruction is issued to adifferent load unit and can therefore be executed independently and inparticular concurrently. As the second shown instruction (ldr r3, [bp0],#1) depends on the same base pointer (bp0) as the first showninstruction (ldr r2, [bp0], #1), the address calculation of therespective two pointers must be adjusted to compute correctly within aloop when independently calculated. For example: Both pointers incrementby an offset of 1. If sequentially executed, however, both addresses,address of r2 and address of r3, would move in steps of 2, as theinstructions add 2-times a value of 1. But, executed in parallel and indifferent load units, both addresses would only move in steps of 2.Therefore the offset of both instructions must be adjusted to 2 andfurthermore the base address of the second instruction (ldr r3, [bp0],#1) must be adjusted by an offset of 1. Respectively when detecting andissuing the second instruction, the offset of the first must be adjusted(as shown by the second arrow of 2). Accordingly (but not shown) mustthe address generation of the other load and store instructions (e.g.relative to base pointers bp1, bp2 and bp3) be adjusted.

FIG. 6-4, Arrow 1: setup Load Unit 0. Arrow 2: setup Load Unit 1 andmodify striding of Load Unit 0. Remaining Load/Store Units are issuedrespectively. Register sources are added to register library.

FIG. 6-5 shows the setup of/microcode issue to the Store units inaccordance with detected instruction patterns and/or macros. The storeunits support complex store functions storing conditionally one of a setof immediate value depending on status signals (e.g. the processorstatus). The shown code stores either a zero value (xor r0, r0, r0) or aone (movcs r0, #1) to the address of base pointer bp3, depending on thecurrent status. The conditional mnemonic-extensions ‘cc’ and ‘cs’ arerespectively used. For details on the ARM instruction set see [13]. Asdescribed before, the instruction decoder (e.g. the optimizer passes0405 and/or 0410) recognizes the code patterns and sequences, whichmight be fused and the joint information is transmitted (1 and 2) by amicrocode to the store unit.

FIG. 6-5, Arrows 1 and 2: Control store pattern are recognized and storeunits are setup respectively.

FIG. 6-6 shows the issue of the instructions dedicated to the ALUs. Theinstructions are issued according to their succession in the binarycode. The issue sequence is such that first a row is filled and thenissuing continues with the first column of the next lower row. If aninstruction to be issued depends on a previously issued instructionsuch, that it must be located in a lower row for being capable ofreceiving required results from another ALU due to network limitations,it is accordingly placed (see FIG. 6-6 6). Yet, code issue continuesafterwards with the higher available ALU. Consequently issue pointermoves up again (see FIG. 6-6 7). For details on code distributionreference is made to [1] and [4] (both incorporated by reference forfull disclosure), e.g. a rotor, see [4] FIG. 14 and/or [1] FIG. 17a andFIG. 17 b.

FIG. 6-6, Arrows 1 to 5: issue instructions in reading order. Arrow 6:depending on arrow 5: issue one ALU below. Arrow 7: resume with firstfree ALU in reading order. Arrow 8: depending on arrow 7: issue one ALUbelow. The network is setup according to the register directory, andthere are no transactions through the register file.

FIG. 6-7 shows an example of an enhanced instruction set providingoptimized ZZYX instructions: Shown is the same loop code, but thecomplex code macros requiring fusion are replaced by instructions whichwere added to the ARM's instruction set:

The lsuld instruction loads bytes (lsuldb) or words (lsuldw) frommemory. Complex address arithmetic is supported by the instruction, inwhich an immediate offset is added (+=offset) to a base pointer whichmight then be sequentially incremented by a specific value (̂ value) witheach processing cycle.

The lsust instruction stores bytes (lsustb) or words (lsustw) to memory.The address generation operates as for the lsuld instruction.

A for instruction defines loops, setting the start-, end-values, and thestep width; all in a single mnemonic. The endfor instructionrespectively indicates the end of the loop code.

The code shown in FIG. 6-7 is also listed below for better readability:

lsuldw r4, bp0 += {circumflex over ( )}1 ; old_sm0 lsuldw r5, bp0 +={circumflex over ( )}1 ; old_sm1 lsuldw r6, bp1 += 0 {circumflex over( )}1*4 ; bm00 lsuldw r7, bp1 += 1 {circumflex over ( )}1*4 ; bm10lsuldw r8, bp1 += 2 {circumflex over ( )}1*4 ; bm01 lsuldw r9, bp1 += 3{circumflex over ( )}1*4 ; bm11 lsustw r0, bp2 += 0 {circumflex over( )}2 ; new_sm0 lsustw r2, bp2 += 1 {circumflex over ( )}2 ; new_sm1lsustb s0, bp3 += 0 {circumflex over ( )}2 ; dec0 (rss!) lsustb s1, bp3+= 1 {circumflex over ( )}2 ; dec1 (rss!) for 0,<=7,+1   add r0, r4, r6  add r1, r5, r7   add r2, r4, r8   add r3, r5, r9   cmp r0, r1   cmpr2, r3   movle r0, r1   movle r2, r3 endfor

The listed code has the identical structure as in the Figure for easyreferencing.

In a preferred embodiment, the instruction set is enhanced withdedicated instructions, e.g.:

-   -   Isuld(w/b), Isust(w/b): memory instructions optimized for block        transfers, including 2D/3D address generation, e.g., striding;    -   for, endfor: loop instructions managing loop control; and    -   Advanced (thumb) instructions in Loop Acceleration Mode        increasing code density.

Literature and Patents or Patent Applications Incorporated by Reference:

The following references are incorporated by reference into the patentfor complete disclosure. It is expressively noted, that claims maycomprise elements of any reference embodied into the specification:

[1] ZZYX07: PCT/EP 2009/007415 (WO2010/043401); Vorbach

[2] ZZYX08: PCT/EP 2010/003459 (WO2010/142432); Vorbach

[3] ZZYX09: PCT/EP 2010/007950; Vorbach

[4] ZZYX10: PCT/EP 2011/003428; Vorbach

[5] ZZYX11: DE 11 006 698.2; Vorbach

[6] http://www.videolan.org/developers/×264.html: VideoLAN, VLC mediaplayer and ×264 are trademarks registered (or in registration process)by the VideoLAN non-profit organization. Software are licensed under theGNU General License.

[7] PACT04: U.S. Pat. No. 7,028,107; Vorbach et al

[8] PACT10: U.S. Pat. No. 6,990,555; Vorbach et al

[9] PACT08: U.S. Pat. No. 7,036,036; Vorbach et al.

[10] The unabridged Pentium 4; IA32 Processor Genealogy; Tom Shanley;Mindshare Inc.; ISBNO-321-25656-X

[11] Trace Cache: a Low Latency Approach to High Bandwidth InstructionFetching; Eric Rotenberg Computer Science Dept. Univ. of Wisconsin,Steve Bennett Intel Corporation, James E. Smith Dept. of Elec. and Comp.Engr. Univ. of Wisconsin; Copyright 1996 IEEE. Published in theProceedings of the 29th Annual International Symposium onMicroarchitecture, Dec. 2-4, 1996, Paris, France.

[12] U.S. Pat. No. 5,381,533; Peleg et al (DYNAMIC FLOW INSTRUCTIONCACHE MEMORY ORGANIZED AROUND TRACE SEGMENTS INDEPENDENT OF VIRTUALADDRESS LINE)

[13] ARM Architecture Reference Manual; Copyright © 1996-1998, 2000,2004, 2005 ARM Limited. All rights reserved. ARM DDI 01001

APPENDIX A 0000007c <x264_encoder_delayed_frames>: 0000007c: e92d40f8push {r3, r4, r5, r6, r7, lr} 0000007e: e59f3114 ldr r3, [pc, #276];(194 <x264_encoder_delayed_frames+0x118>) 00000080: e2904000 adds r4,r0, #0 00000082: e7900003 ldr r0, [r0, r3] 00000084: e3500001 cmp r0, #100000086: ca000000 bgt.n 8a <x264_encoder_delayed_frames+0xe> 00000088:ea000082 b.n 190 <x264_encoder_delayed_frames+0x114> 0000008a: e3b030aamovs r3, #170; 0xaa 0000008c: e1b05103 lsls r5, r3, #2 0000008e:e1b00100 lsls r0, r0, #2 00000090: e2506004 subs r6, r0, #4 00000092:e0947005 adds r7, r4, r5 00000094: e1b01e06 lsls r1, r6, #28 00000096:e59f2100 ldr r2, [pc, #256]; (198 <x264_encoder_delayed_frames+0xl1c>)00000098: e5976000 ldr r6, [r7, #0] 0000009a: e3b03004 movs r3, #40000009c: elbOlf21 lsrs r1, r1, #30 0000009e: e7965002 ldr r5, [r6, r2]000000a0: e1530000 cmp r3, r0 000000a2: 0a00003d beq.n 120<x264_encoder_delayed_frames+0xa4> 000000a4: e3510000 cmp r1, #0000000a6: 0a00001c beq.n e2 <x264_encoder_delayed_frames+0x66> 000000a8:e3510001 cmp r1, #1 000000aa: 0a000010 beq.n ce<x264_encoder_delayed_frames+0x52> 000000ac: e3510002 cmp r1, #2000000ae: 0a000006 beq.n be <x264_encoder_delayed_frames+0x42> 000000b0:e3b060ab movs r6, #171; 0xab 000000b2: elb01106 lsls r1, r6, #2000000b4: e0947001 adds r7, r4, r1 000000b6: e5973000 ldr r3, [r7, #0]000000b8: e7936002 ldr r6, [r3, r2] 000000ba: e3b03008 movs r3, #8000000bc: e0955006 adds r5, r5, r6 000000be: e3b070aa movs r7, #170;0xaa 000000c0: e0941003 adds r1, r4, r3 000000c2: e1b06107 lsls r6, r7,#2 000000c4: e0917006 adds r7, r1, r6 000000c6: e5971000 ldr r1, [r7,#0] 000000c8: e2933004 adds r3, #4 000000ca: e7917002 ldr r7, [r1, r2]000000cc: e0955007 adds r5, r5, r7 000000ce: e3b010aa movs r1, #170;0xaa 000000d0: e0946003 adds r6, r4, r3 000000d2: elb07101 lsls r7, r1,#2 000000d4: e0961007 adds r1, r6, r7 000000d6: e5916000 ldr r6, [r1,#0] 000000d8: e2933004 adds r3, #4 000000da: e7961002 ldr r1, [r6, r2]000000dc: e0955001 adds r5, r5, r1 000000de: e1530000 cmp r3, r0000000e0: 0a00001e beq.n 120 <x264_ encoder_delayed_frames+0xa4>000000e2: e3b070aa movs r7, #170; 0xaa 000000e4: e0941003 adds r1, r4,r3 000000e6: e1b06107 lsls r6, r7, #2 000000e8: e0917006 adds r7, r1, r6000000ea: e5976000 ldr r6, [r7, #0] 000000ec: e3b070aa movs r7, #170;0xaa 000000ee: e7961002 ldr r1, [r6, r2] 000000f0: e1b07107 lsls r7, r7,#2 000000f2: e0955001 adds r5, r5, r1 000000f4: e2931004 adds r1, r3, #4000000f6: e0946001 adds r6, r4, r1 000000f8: e0966007 adds r6, r6, r7000000fa: e5966000 ldr r6, [r6, #0] 000000fc: e0941001 adds r1, r4, r1000000fe: e7966002 ldr r6, [r6, r2] 00000100: e297700c adds r7, #1200000102: e0955006 adds r5, r5, r6 00000104: e3b060ab movs r6, #171;0xab 00000106: e1b06106 lsls r6, r6, #2 00000108: e0911006 adds r1, r1,r6 0000010a: e5911000 ldr r1, [r1, #0] 0000010c: e7916002 ldr r6, [r1,r2] 0000010e: e0941003 adds r1, r4, r3 00000110: e0955006 adds r5, r5,r6 00000112: e0916007 adds r6, r1, r7 00000114: e5961000 ldr r1, [r6,#0] 00000116: e2933010 adds r3, #16 00000118: e7917002 ldr r7, [r1, r2]0000011a: e0955007 adds r5, r5, r7 0000011c: e1530000 cmp r3, r00000011e: laffffe0 bne.n e2 <x264_encoder_delayed_frames+0x66> 00000120:e3b01096 movs r1, #150; 0x96 00000122: e1b03181 lsls r3, r1, #300000124: e7942003 ldr r2, [r4, r3] 00000126: e29220aa adds r2, #170;0xaa 00000128: e1b00102 lsls r0, r2, #2 0000012a: e7904004 ldr r4, [r0,r4] 0000012c: e59f706c ldr r7, [pc, #108]; (19c<x264_encoder_delayed_frames+0x120>) 0000012e: e7943007 ldr r3, [r4, r7]00000130: e5936000 ldr r6, [r3, #0] 00000132: e3560000 cmp r6, #000000134: 0a000004 beq.n 140 <x264_encoder_delayed_frames+0xc4>00000136: e2933004 adds r3, #4 00000138: e8b30001 idmia r3!, {r0}0000013a: e2955001 adds r5, #1 0000013c: e3500000 cmp r0, #0 0000013e:lafffffb bne.n 138 <x264_encoder_delayed_frames+0xbc> 00000140: e59f605cldr r6, [pc, #92]; (1a0 <x264_encoder_delayed_frames+0x124>) 00000142:e7940006 ldr r0, [r4, r6] 00000144: e2900035 adds r0, #53; 0x3500000146: e29000ff adds r0, #255; 0xff 00000148: ebfffffe bl 0<pthread_mutex_lock> 0000014c: e7940006 ldr r0, [r4, r6] 0000014e:e2900024 adds r0, #36; 0x24 00000150: ebfffffe bl 0 <pthread_mutex_lock>00000154: e7940006 ldr r0, [r4, r6] 00000156: e29000ac adds r0, #172;0xac 00000158: ebfffffe bl 0 <pthread_mutex_lock> 0000015c: e7940006 ldrr0, [r4, 6] 0000015e: e3b020a8 movs r2, #168; 0xa8 00000160: e7901002ldr r1, [r0, r2] 00000162: e5907020 ldr r7, [r0, #32] 00000164: e3b03098movs r3, #152; 0x98 00000166: e0912007 adds r2, r1, r7 00000168:e1b07083 lsls r7, r3, #1 0000016a: e7901007 ldr r1, [r0, r7] 0000016c:e29000ac adds r0, #172; 0xac 0000016e: e0923001 adds r3, r2, r100000170: e0935005 adds r5, r3, r5 00000172: ebfffffe bl 0<pthread_mutex_unlock> 00000176: e7940006 ldr r0, [r4, r6] 00000178:e2900024 adds r0, #36; 0x24 0000017a: ebfffffe bl 0<pthread_mutex_unlock> 0000017e: e7940006 ldr r0, [r4, r6] 00000180:e2900035 adds r0, #53; 0x35 00000182: e29000ff adds r0, #255; 0xff00000184: ebfffffe bl 0 <pthread_mutex_unlock> 00000188: e2950000 addsr0, r5, #0 0000018a: e8bd00f8 pop {r3, r4, r5, r6, r7} 0000018c:e8bd0002 pop {r1} 0000018e: el2fffll bx r1 00000190: e3b05000 movs r5,#0 00000192: eaffffcb b.n 12c <x264_encoder_delayed_frames+0xb0>00000194: 00000504 .word 0x00000504 00000198: 000004ac .word 0x000004ac0000019c: 00003ae0 .word 0x00003ae0 000001a0: 00007d20 .word 0x00007d20

1. (canceled) 2-8. (canceled)
 9. A method for translating high-levelsoftware code into a hardware representation, the method comprising:analyzing the high-level software code by analyzer software; splittingthe high-level software code into blocks, including mapping a pluralityof software instructions of the high-level software code to hardwareroutines, one or more blocks of the plurality of blocks comprising onemain routine and at least one subordinate code block, the plurality ofblocks being defined by jump and call instructions to code sections orsubroutines of the high-level software code, each block of the pluralityof blocks having a unique identifier, wherein software instructions ineach block are mapped to hardware functions and are arranged in anexecution order, at least some of the software instructions in eachblock configured to execute at least one of concurrently andasynchronously, wherein particular instructions of the softwareinstructions that depend on other instructions of the high-levelsoftware code receive input data from results produced by the otherinstructions, wherein the plurality of blocks are configured such that,upon completion of execution of the software instructions in arespective block, the respective block returns a pointer matching theunique identifier of a next block to execute; and generating thehardware representation based on the plurality of blocks.